Bulletin of Electrical Engineering and Informatics 
Vol. 12, No. 2, April 2023, pp. 987~996 
ISSN: 2302-9285, DOI: 10.1159 1/eei.v12i2.4182 o 987 


Depression detection in social media comments data using 
machine learning algorithms 


Zannatun Nayem Vasha, Bidyut Sharma, Israt Jahan Esha, Jabir Al Nahian, Johora Akter Polin 


Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh 


Article Info ABSTRACT 


Article history: Depression is the next level of negative emotions. When a person is in a sad 
. mood or going through a difficult situation and it is not leaving him and giving 
Received Jun 1, 2022 him pain continuously and he is unable to bear it anymore, that situation is 
Revised Aug 6, 2022 called depression. The last stage of depression occurs in suicide. According to 
Accepted Sep 8, 2022 the World Health Organization (WHO), Currently, 4.4% of people in the 
world are currently suffering from depression. In 2021, fourteen thousand 

people committed suicide all over the world and the rating of suicide is 

Keywords: increasing day by day. So, our study is to find depressed people by their 
comments, posts, or texts on social media. We collected almost 10,000 data 
from Facebook posts, comments, and YouTube comments. Data mining and 
machine learning (ML) algorithms make our work easier and play a big role 


Data mining 
Depression detection 


Facebook f in easily detecting a person’s emotions. We applied six classifiers to predict 
Machine learning depression & non-depression and found the best accuracy on a support vector 
Social media machine (SVM). 
This is an open access article under the CC BY-SA license. 
(OKO 
Corresponding Author: 


Zannatun Nayem Vasha 

Department of Computer Science & Engineering, Daffodil International University 
Dhaka, Bangladesh 

Email: zannatun15-12939@diu.edu.bd 


1. INTRODUCTION 

Emotions come in many forms, including positive and negative emotions. Positive emotions are love, 
laughter, and happiness. Negative emotions can be anger and sadness or depression. Negative emotions can be 
very serious and enormous and can lead to many deaths [1]. Depression means a common and serious illness 
that can harm physical & mental wellbeing. It affects how someone feels, how he or she thinks or how he or 
she acts. Depression is a mental and activity disorder. Scientifically, various events in personal life, major 
illness, chemical imbalance in the brain, etc. are blamed for depression. Suicide is a major and probable 
inevitable public health problem. Depression is the main cause of suicide [2]—[4]. People with depression suffer 
not only mental difficulties but also physical problems and the suicide rate becomes risen [5]. 80% of people 
who attempt suicide suffer from depression [6]. Depression can be caused for various types of reasons such as 
Abuse of physical, sexual, emotional, drug-taking personal conflict with anyone, losing someone special, and 
long-time major illness. Also, childbirth, menopause, unemployment, stress, low income, hassle, jealousy, 
separation, and social rejection lead a person to depression. 

According to the World Health Organization (WHO) 322 million people around the world suffer from 
depression, which is 4.4% of the world's population. People who encounter the COVID-19 area are forever 
afraid and those who suffered from COVID become depressed more. In the last year, 14 thousand 436 men 
and women have committed suicide all over the country. Lower or middle-income countries account for 75% 
or three-quarters of all suicides worldwide. Bangladesh ranks tenth on this list. According to a survey by the 
national institute of mental health, an average of 26 people commit suicide every day in the country. Most of 
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them are women. Women are about twice as likely to become depressed as men are. Depression is so common 
in the young generation because today’s youth are indoors and addicted to video games and mobile phones. 
Most of the total suicides among men and women are between 21 and 30 years of age. Depression has led to a 
rise in alcohol consumption, and a study found that college students are much higher on the list [7]. The more 
people stay indoors, the more they will be physically and mentally diseased. As long as man is in nature, man 
has his independent identity because a man gets healing power from nature. Depression is one of the most 
common mental disorders that does not let us understand what joy is [8]. Depression can be recognized through 
a personal interview with a psychiatrist, and social media activities. There are many symptoms of depression. 
From those symptoms, someone must have experienced at least four symptoms among them daily or twice a 
week. Such as sleeping problems, digestive problems, fatigue, restlessness, and thoughts of suicide [9]. Also, 
he/she loses interest in hobbies, loses weight, continuous disappointments, headaches, and uncontrolled anger 
are possible signs of depression [10]. 

According to WHO, depression means low mood, low energy, and low interest. Researchers are using 
various tools of social media networks such as comments, and related activities to detect and predict depression. 
Social networks such as Facebook, Twitter, YouTube, Instagram, and WhatsApp. have become a common 
platform for sharing thoughts, feelings, and overall moods in our daily lives. This platform has become a 
valuable database for researchers [11]. People share their experiences, activities, and stories on social media. 
Those activities help to detect early depression [12]. Different types of depression, consisting of many 
emotional posts and photos, are shared on social media. Increased severity and parameters of depression are 
well known for the risk of suicide with disability [13]. In the epidemiological study, people who respond 
quickly to interviews are happier and have a better mood. They are very positive about their future and from 
these, it is clear that they are normal people and their depression scale is 0 [4]. If we do not treat depression for 
a long time, we may fall behind in various fields like learning, intellectual quotient (IQ), skill development, 
and quick work [14]. 

A statistical report says, all over the world almost 4.67 billion people are using the internet and 
searching for information for various reasons [15]. Many types of research are being done on depression 
detection. For detecting depression, the dataset can be collected from different types of social media such as 
Facebook, Twitter, Reddit, blogs, live journals, and Instagram. Twitter is one of the most popular social 
networking sites with 326 million active users [12]. Smys and Raj [13] said that non-depressed people use 
Twitter as a tool for gathering information, whereas depressed people use it as a tool for social awareness. 
Reddit is also a popular social media site. Research by Choudhury et al. [16] analyzed the post of Reddit users 
& identified depressive or suicidal ideation. His features to predict were self-concern, poor linguistics, 
hopelessness, and anxiety needed. Research by Tadesse et al. [12] predicted depression from Redditt forums 
using the natural language processing (NLP) and machine learning (ML) classifiers. They researched 1,293 
depression indicative posts and 548 standard posts and support vector machine (SVM) gave them 90% accuracy 
to identify from related words of different feelings. 

Alsagri and Ykhlef [6] collected 300,000 tweets from 111 user profiles. They applied various classifiers 
but SVM shows the best accuracy of 82.5% in identifying depression. Chatterjee et al. [17] collected a set of 
7,146 Facebook comments and created a dictionary of 8,220 words of different emotions. They found more than 
3,000 people with depressive comments. They applied the Naive Bayes (NB) theorem to get the result. 
Wu et al. [18] collected data from 24 adults, 27 pictures are concerned to recognize emotion. They worked with 
pictures to detect depression with SVM. Smith et al. [19] interviewed 387 pregnant women and only 26% were 
screened for anxiety disorder. Arachchige et al. [15] identified 1,335 references from different datasets and used 
ML and NLP for depression detection. They used Facebook, Twitter, and Reddit forums. Research by 
Jung et al. [20] analyze different approaches to social media posts and research with frequent answering 
questions (FAQ) onto ontology concepts to detect depression from social media. Choudhury et al. [16] analyzed 
social media content for their study. They studied the mental health of college students and it is almost 100 
universities. They searched for depressive Reddit posts over the years. 

Giuntini et al. [21] researched different types of social media like Twitter, Facebook, blogs, and 
Reddit. Their used information was text, emoticons, and images. They applied classifiers and used different 
tools to recognize depression. Tummala et al. [22] have used ML calculations like SVM, random forest (RF), 
logistic regression (LR), and NB. They collected huge data from Facebook comments, depressed and non- 
depressed Reddit posts, Reddit users, and Twitter. Most of the data had been collected from Twitter. 
Priya et al. [23] have applied several algorithms based on five severity levels namely, normal, mild, moderate, 
severe, and extremely severe. They collected data using a standard questionnaire measuring the common 
symptoms of depression. Although the accuracy rate of NB is higher, RF has been identified as the best model. 
Sau and Bhakta [24] collected 470 data from medical college and hospital students. They collected socio- 
demographic and occupational health-related information. Catboost, LR, NB, RF, and SVM was used where 
catboost is an art algorithm to boost decision trees (DTs) and it took the best accuracy. 
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Narziev et al. [5] detected depression using smartwatches and smartphones. They built short-term 
depression detection (STDD), an android application to record a person’s daily routine, and extracted five factors 
of depression (physical activity, mood, social activity, sleep, and food intake. They collected 20 students' data 
based on four depression groups (normal, mildly depressed, moderately depressed, and severely depressed) 
through a patient health questionnaire (PHQ-9). Ilgen et al. [2] identified the rates of suicide among some patients 
who are treated for depression. They worked with 887,859 VA patients and developed DTs to measure the risk 
of depression. They found suicide rate was 89.55 from 100,000 people per year. Cacheda et al. [11] collected 
500,000 different posts and comments from Reddit and found which writing could be considered depression. 
They conceded that depressed people are interested in replying to existing posts or issues more than publishing 
new ones and the difference between depressed and non-depressed writings. They got 15% major depression 
disorder (MDD) from their dataset. Saqib et al. [25] analyzed 1,392 articles and applied ML, and big data for 
predicting postpartum depression (PPD). ML algorithms are capable of analyzing the largest datasets. It is better 
to detect PPD at an early stage. Haque et al. [14] have collected a dataset from young minds matter (Y MM). They 
identified 11 important symptoms (unhappy, nothing fun, and irritable mood) to detect depression among children 
and adolescents. If any five symptoms of these 11 symptoms are present in someone then that person is depressed. 
RF has been able to predict 99% depression in 315 milliseconds. 

Research by Trotzek et al. [26] made a dataset from 887 Reddit users containing 10-2,000 documents. 
Their goal was to classify which post-depression in chronological order and obtain results from five models of a 
classifier based on bag of woks (BOW), paragraph-vector, latent semantic analysis (LSA), recurrent neural 
networks (RNN), and long short-term memory (LSTM). Low et al. [27] concluded teen suicide is one of the 
leading causes of death in Australia. Most suicides are linked to depressive disorders & symptoms. To understand 
and prevent depression and suicide in teens, they have carried out several studies. Kumar et al. [3] anxiety- 
depressive disorder (AD) is predominantly associated with erratic thought processes, insomnia, and restlessness. 
AD prediction model for anxious depression prediction in real-time tweets is proposed. The proposed model 
achieves a classification accuracy of 85.09% for tweets of sampled 100 users. Syms and Raj [13] tested over 2,500 
sentences for emotion prediction from Twitter dataset. The validation of the dataset gives more accuracy than 
other existing individual classifiers for early detection. Emotion recognition and depression detection are difficult 
tasks for single classifiers. Islam et al. [1] showed that all these methods of separation are based on linguistic 
style, emotional process, and temporary process and to effectively release the effect of depressing emotions, all 
aspects must be familiar to. They used 21 types of linguistic inquiry and word count (LIWC) software for stress 
detection. Stankevich et al. [28] explored different sets of features for the task of depression detection in social 
media. Islam [29] used ML techniques for depression detection evaluated using a set of various psycholinguistic 
and textual features. Then the result shows that in different experiments, the DT gives the highest accuracy. 
Dey et al. [8] used many algorithms. This paper performs an analytical study for depression analysis in a social 
media context. Supervised classifier algorithms are very popularly used in ML. It is a highly scalable algorithm 
and a very popular method for text categorization. Bailey and Plumbley [30] examined gender bias in DAIS- 
WOZ dataset, which contains 189 participants interviewed, and find 57 participants with posttraumatic stress, and 
132 are not through depression detection questions (PHQ 8). He found females are more depressed. The ratio for 
females of depressed: non-depressed is 5:8 whereas the male ratio is 2:7. 

Data mining is the significant extraction of underlying, previously unknown, and potentially useful 
knowledge from data in large data repositories. Nowadays data mining and ML plays a vital role in detecting 
depression. Our study is to identify whether someone is depressed or not using DMML. We collect information 
from different sources and implement specific knowledge through data mining to get a better solution & detect 
depressive users on social media. Generally, ML is a method of analyzing data in a system, it analyses a set of 
data from our survey & predicts someone’s mental health. This paper is sorted as follows: section 2 represents 
the research methodology including a description of the dataset, implementation process, and classifiers 
description. Section 3 discusses the experimental result, other findings, and comparative analysis. Finally, 
section 4 provides a conclusion and future plan. 


2. METHOD 

Our study utilized data mining techniques to analyze data and to predict whether it was depressive or 
non-depressive. This section explains the method which is applied for the identification of depressive posts or 
comments on social media. Data preprocessing data extraction, text processing, and classification models are 
performed gradually. Among almost ten thousand data in the Bangla language collected from Facebook book 
posts and comments, we split 80% of data in training and the rest 20% of data in testing. Then data has been 
labeled for ML in term frequency (TF) and inverse document frequency (IDF) vectorizer. After that, we applied 
RF, LR, DT, SVM, K-nearest neighbors (KNN), and multinomial NB classifiers to predict depressive posts on 
comments. 
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2.1. Implementation process 

For implementing this work, we have performed several steps that are presented in Figure 1. We have 
made a dataset of 10,000 individual data on depression and non-depression collected from Facebook text or 
comments or posts of a single line. We have done some preprocessing in the dataset. We have made an extra 
column named “spam” where we did number for the "Sentimental column". For depression, we had counted 1 
and for no depression, we had counted 0. Then all duplicate data had been deleted. Around three hundred 
duplicate data were found in our dataset. After that, we split the data into two sectors. One sector is data training 
for ML. Another division is for testing results 80% of data had been required for training and 20% for testing. 

Feature engineering is the process to create features that are used by ML algorithms to find patterns, 
generated to extract all information to make it understandable for ML algorithms and to help in prediction. In 
the data pre-processing section, we used TF and IDF for feature extraction. Using both vectors, we made 
labeling to our dataset for ML. 

TF-IDF is a numerical statistic that intends to reflect how important a word is to document in a 
collection. It is a common factor in text mining, and text processing. By TF we count the number of data times 
each term occurs in each data and IDF is incorporated to diminish the weight of terms that occur very frequently 
in the document dataset. TF-IDF compares keyword frequency to the competitors and helps to improve topical 
relevance. It is just used for data labeling where it converts our string data to a numeric value. 
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Figure 1. Proposed framework 


2.2. Classifier description 

In this section, we employ six classifiers as applied RF, LR, DT, SVM, KNN, and multinomial NB. 
Those classifiers are one of the most popular algorithms in ML. Classifier defines the algorithm that assigns 
our data to a range of classes. 


2.2.1. Naïve Bayes 

The NB is called the most straightforward method in ML. It is applied where the features are 
independent [14]. NB is a simple and efficient classifier to create probabilistic models [18]. We used 
multinomial NB in our study. Figure 2 shows the basic structure of NB. 
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Figure 2. The NB 
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2.2.2. Support vector machine 

The SVM is also known as SVM which is a non—probabilistic linear binary classifier [22]. It is 
normally used to recognize or detect any problems. This classifier behaves well in prediction and is also better 
than individual classifiers [26]. Figure 3 shows a simple method of SVM. 
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Figure 3. The SVM 


2.2.3. Random forest 

In this method, the dataset is divided into many individual trees [27]. It has more clear and 
understandable prediction rules and increases the trees that become stronger with the model [26]. RF is used 
to solve mainly regression and classification problems. Figure 4 shows the basic structure of a RF that is created 
from many DTs. 
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Figure 4. The RF 


2.2.4. Decision tree 

The DT is used to track emotion in emotion detection algorithms [26]. It is a way to display algorithms 
that control conditional statements and a nonparametric supervised learning method. Also, an interpretable 
classifier creates a hierarchical tree of training instances [28]. Figure 5 shows the basic structure of the DT 
which contains several nodes. 


2.2.5. Logistic regression 

The LR uses a logistic function to predict binary outcomes. It deals with the dependent and 
independent variables. By using it, we classify data with a decision boundary [27]. Figure 6 shows the method 
of LR which contains dependent and independent variables. 
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2.2.6. K-nearest neighbor 

The KNN algorithm is a non-parametric supervision which is used for regression and classifications. 
It's used on testing data and calculates the distance between test data and training points. K number of points 
which is closer to test data. Figure 7 shows the simple structure of KNN. 
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Figure 7. The KNN 


3. RESULTS AND DISCUSSION 
3.1. Evaluation and measures 

These automated predictions will be through accuracy, precision, recall, F1 scores, confusion matrix 
(CM), and curves, which are defined as follows. We can say by accuracy how many times the ML model was 
correct overall. Precision is how good, advanced and quicker the model is at predicting a particular section. 
The recall is how many times the model has been to identify a specific category. F1 score/measure is the 
average of precision and recall. F1 score is a measure of the model’s accuracy on a dataset. The F1 score is 
perfect when it’s 1 and the F1 score is a total failure when it's 0. A CM is a visual representation of actual and 
predictive values and a tabular summary of the number of correct and incorrect predictions made by a classifier. 
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False Positive 
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3.2. Result 

Our experiment has been conducted on different terms of feature extraction, using various 
classification algorithms (SVM, DT, LR, RF, NB, and KNN). Our training and testing samples are depressed 
and non-depressed data. For gaining optimal classification results, the TF-IDF feature combination is exploited. 
Table 1 summarizes the precision rate, recall rate, and F1 score for all algorithms. We have measured precision 
rate, recall rate, and F1 score for individual depressive & non-depressive data. Here we found the highest 
precision rate and F1 score are 0.77 and 0.78 obtained from the SVM model. Table 2 shows the accuracy, 
sensitivity, and specificity of each classifier. 


Table 1. Precision rate, recall rate, and F1 score for each model 


Models Precision Recall F1 measure 
SVM Depression 0.77 0.80 0.78 
Non-depression 0.73 0.69 0.71 
LR Depression 0.69 0.92 0.79 
Non-depression 0.82 0.47 0.60 
DT Depression 0.71 0.69 0.70 
Non-depression 0.62 0.63 0.62 
RF Depression 0.74 0.79 0.77 
Non-depression 0.71 0.65 0.68 
NB Depression 0.69 0.92 0.79 
Non-depression 0.82 0.47 0.60 
KNN Depression 0.66 0.77 0.71 
Non-depression 0.63 0.49 0.55 


Table 2. Accuracy, sensitivity, and specificity of different classifiers 
SVM (%) LR(%) _DT(%) _RF(%) NB(%) _ KNN (%) 
Accuracy 75.15 74.65 66.64 73.02 72.19 69.97 
Sensitivity 72 71 61 72 81 62 
Specificity 76 74 70 76 68 66 


Figure 8 shows the comparison between the accuracy, sensitivity & specificity of different models. 
Figure 9 shows a curve of different models at all classification thresholds of our study. This curve plots TP and 
FP rate parameters. We get NB as the best classifier in the roc curve. LR and RF show a slight decrease. 
Moreover, the lowest rate was obtained from the DT for predicting. 
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Figure 8. Comparison of accuracy, sensitivity, and specificity of different classifiers 
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Figure 9. Roc curve for each classifier 


3.3. Comparative analysis 

Table 3 shows the comparative analysis of recent studies for depression detection. They found 
depression among their data using different ML algorithms. Aldarwish and Ahmad [31] collected 6,713 posts 
and found 63% accuracy on NB. They applied two ML algorithms in their study. Where Sau and Bhakta [24] 
collected data from seafarers and applied five algorithms and got the highest accuracy on Catboost, which is a 
different type of classifier we didn't use in our study. Smys and Raj [13] used a hybrid classification that is 
made of two classifiers (SVM and NB) for getting more perfect accuracy and for the betterment of his study. 
Many studies have been done on detecting depression from social media. Since social media is the main tool 
to express our emotions. Therefore, we list a comparative analysis between previous models and our study 
where we collected around ten thousand data, used six classifiers, and got the highest accuracy on SVM. 


Table 3. Comparison of previous studies for depression detection through social media 


Reference Data collection Used algorithm Best outcome 
1 Aldarwish et al. [31] 2,073 depressed posts NB, SVM Accuracy 63% (NB) 
4,700 non depressing post 
2 Alsagri and Ykhlef [6] 500 Twitter users DT, NB, SVM-L, SVM-R Accuracy 82% (SVM-L) 
3 Islam [29] 7,145 Facebook comments DT, SVM, KNN, Ensemble Accuracy 72% (DT) 
4 Smys and Raj [13] 2,500 sentences from DT, RF, SVM, NB, hybrid Accuracy 92% (hybrid 
Twitter classifier (SWM-NB) classifier) 


5 Islam [29] 
6 Sau and Bhakta [24] 


7,145 Facebook comments 
470 seafarers 


DT, SVM, KNN, ensemble 
Catboost, LR, SVM, NB, RF 


Accuracy 72% (DT) 
Accuracy 82.6% (cat boost) 


4. CONCLUSION 


Our study deals with a few ML algorithms to identify or detect the sentimentality of a person through 
social media content like Facebook posts, comments, or any text. Social media is now a big platform to express 
someone’s feelings or any images inside of a person. ML algorithms are capable of analyzing large sets of data, 
performing more computation from those data easily, and making a prediction or informative result. From our 
study, we significantly applied ML algorithms to bring the research above with a future vision. Many 
preprocessing steps are performed, including data preparation, data labeling, feature extraction, and then the 
implementation of classifiers. Successfully classifiers detect depression when we collect ten thousand data of 
different posts and comments from various types of people’s profiles. ML becoming more accessible to the 
researcher. So sometimes, it also becomes challenging for the researcher. In future work, we plan to use another 
technique to extract paraphrases and we plan to use more datasets to make our study efficient and effective. 
More focused studies in depression analysis are needed. Our research can be extended in the future by using 
deep learning methods to gain more accurate predictions of depression. We can also measure the depression 
level or degree of a person after detecting depression and can suggest some suitable steps to get rid of this 
depression to follow. 
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